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ABSTRACT 



Systems and methods for managing a plurality of electroni- 
cally stored documents in an open document repository 
employ a one-way hash function to compute a hash for the 
stored documents as an indexing link. A document manage- 
ment index maps an attribute of an original document stored 
in the repository to the hash and the document. A hash-to- 
location index maps the hash to an address location of the 
document in a file system of the repository. The attribute 
points to the hash which then points to the location for 
Unking the attribute to the location. 

23 Claims, 3 Drawing Sheets 
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INDEXING SYSTEM USING ONE-WAY HASH access to a document. Lc.to make the document unreadable. 

FOR DOCUMENT SERVICE Many businesses have a requirement for an ability to be able 

to actually destroy all copies of some documents after a 

BACKGROUND OF THE INVENTION period of time. However, the presence of permanent docu- 

This invention pertains to the art of information process- * meat archives also makes this diffioilL 

ing systems and more particularly to document management The present invention contemplates a new and improved 

systems for electronically stored documents. indexing scheme which overcomes all the above referred to 

The invention is particularly applicable to open document P"*lems an* others to provide an open document manage- 

repositories for storing large numbers of documents and will n ™* ^tem that is flcxfcle. reliable and robust which is 

be described with particular reference thereto. However, it 10 ™% adaptable to a plurality of repository systems and 

will be appreciated that the invention has broader «*™« ?"> ^ provm« improved access 

appreciations, such as systems which use various kinds of control «y*«» ovcr known s y stcms ' 

indexing schemes and may be advantageously employed in BRIEF SUMMARY OF THE INVENTION 

such other environments and applications. ....... 

™_ .__ r . . . f _ 15 Xhc subject invention provides a method and system for 

There are many known types of indexing schemes for , . J , , "* ^ . ... , 3 ^ . 

j Jt ~ .—.,»_ mnldno a Hnlc between a document index and an actual 

obtaining access to documents in a repository. Typically nuuang a uwl uowrai a uwunnu muwv ouu i«i """" 

some parncular attribute of a document wiUbe employed as « torage locahon of the document m a file ^stem too^ghone 

the key or pointer to the location of the document in the file lcvcl of m ^°^ * ^ ^I'l L « 

systenx However, as such repositories become more distrib- „ ?W * ^^^n^lt^Z^^ 

' . . , . m „,^,.Il.,M, „ „^„^ Lr ... . , 1>ir 20 from a one-way hash function processing of the document 

utedaod opedy accessible through a network, the reliabhty documents typically are indexed bylttribute or content 

of the prior known document management systems has ^^SS^Indai ^ ^ ^tc to the 

become highly suspect For example, someone could access . wtuiueiu uuuabmumu ^ 

d^n^VWository.3ify them or move them ^ ^T^^^X^l^ 

without informing the repository index. The resulting losses „ ™ c hash^o^locaUo^ L^*™ 

in integrity of documentor toe errors in the document 23 ^* to ^ ""^Sl^^ 

management indexes may be unacceptable. Such problems functl0ns as aD ^nnediate Link between the attribute and 

cTpSsent a substantiaJ^uand^TdocumentCage- ^ ^ T* !? 

ment groups charged with toeTe^nsibilities of maiSng frcm a low probabihty^h fimcUon or a cryptographically 

a relink open repository. Idealise would like to have the ^ secure hash function. The attribute compnses a user prese- 

reliabilityXd rc^stn«s of a close* document repository to 30 ^ ««nbute such as tide, author, keyword or content note, 

assure the validity of the repository, but the enormous Several advantages result from the invention which all 

customer demand for an open document management sys- generally relate to providing improved reliability and secu- 

tem has presented a substantial need for a system which can rity of repository access and operation such as may be 

accornmodate wide-spread document access, while main- „ expected in a "closed" repository, but now are also available 

taining a reliable and valid index and repository. The par- in 40 "op" 1 " repository. In particular, moving and restoring 

ticular problems with prior known systems have occurred In documents from backup can be done without invalidating 

several ways the document management index. The host-to-location 

When a storage element such as a disk fails, the whole *** * u P dalwl *V cUbcr M ^ uscr action OT a 

file system for that disk may need to be restored from a w background scanning process. 

backup system. Al restoration, all the pointers from the Another advantage is the provision of capability based 

document management index for that system will point to access control to particular documents by either the one-way 

the wrong address locations. They will point to where the hash of the original document or a variation computed by 

sides were, rather than where they are at restoration. hashing the document with some standard prefix attached. 

Accordingly, the entire document management index has to 43 mus mJB:e siia P i y enabling a variety of access control 

be completely redone. regimes. 

For archive systems, it is important to guarantee that any Yet another advantage is mat a smaller document archive 

document that is indexed in the repository wOl be available ^ be achieved by using the backup system of the file 

as long as the index is valid For prior systems, the only way system as a safeguard. That is. a background process, 

to do this was to copy every document to a different and 50 explicit user action, or an action at time of reuse of backup 

separate archival repository. Typically, tape systems wOl be tapes can detect those documents that are indexed but are no 

employed that could be stored off-site. The particular prob- lon S cr available on disk, and restore them from backup into 

lem with such a system is that whenever a document was to a separate archive. 

be archived, then a copy had to be made immediately and the Yet another advantage is the capability to secure a time 

archive repository then could become as large as the original 55 stamp fox documents merely by using one-way hash func- 

file space. The cost of rnain taining such a large archive tions as a document key and generating them regularly, 

system is highly undesirable. Still another advantage of the invention relates to the 

Another particular problem concerns controlling the destruction of documents by effectively making the docu- 

access to documents for security reasons and to preclude raents unreadable by encrypting access to the document with 

undesired changes in the documents themselves. The com- 60 its one-way hash. Thus, to "destroy" a document the opera- 

peting interests between maintaining an open repository and tor need only erase the one-way hash key. 

yet imposing some capability based access scheme to limit Another important advantage of the invention is that it can 

access to the repository have been difficult to resolve. Prior provide an indexing system using a one-way hash for 

known access control mechanisms have not proven to be backup/archive document service in a distributed file system 
sufficiently flexible to accommodate a large number of 65 where a plurality of processing units have a plurality of files 

documents in the repository or a large user base. On the for selective backup or archive. The plurality of files are 

other hand, it is oftentimes necessary to completely preclude intended for communication to a storage in the backup/ 
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archive document service. The system includes a means far document management client (not shown) which will b turn 

computing one-way hash for any of the files and for deter- access the system 10. The document m ana g e m ent system 10 

mining if the one-way hash is already included in the is comprised of three components; a document management 

backup/archive document service and if not, the file is stored index 12. a hash-to-location index 20 and a file system 28. 

and the hash is recorded, but if the file is already included 5 Thc central processing unit 18 controls the data communi- 

in the service, then the system can move onto the next of the these elements. The document management 

index 12 comprises a primary table including a mapping 

. , . . . . between an attribute of an original document stored in the 

Alternatively, a directory of one-way hashes can be com- and a hash 16 computed from that original 

piled with means for «*rnputing a hash or hashes of the Specifically, the hash 16 is obtained by using a 

directory, wherein the hash or hashes can also be stored in w oQ ha ^ cdo L Such hash functions are well known 

the backup/archive document service. ^ compiise a low probability hash function or 

Yet another advantage of the system is implementing it for cryptographically secure hash function. All such functions 

synchronizing files between first and second processing havc ^ ahait y to accept an arbitrarily large input and 

units for replicating files therebetween. In particular, such a produce a small, fixed sized output. Accordingly, random 

method would designate a first processing unit as a file sizcd documents can be processed by the one-way hash 

repository having a plurality of files and a second processing function to generate an associated hash in the index table 12 

unit as a backup service to the file repository. The second so me particular attribute can be uniquely associated with 

processing unit would compute the one-way hash of any of me corresponding document hash. A good general discus- 

the files in the first repository and determine if the one-way sion of such onc . way hash functions is contained in Merklc, 

hash is stared therein and if not record the hash and the file A Fast Software One-Vtay Hash Function, Journal of Cryp- 

in the second processing unit If already included, as above. tology 3:43-58 (1990). 

the backup service can move to the next file. A secondary table 20 comprises the hash-to-location 

In accordance with a more limited aspect of the present index including a mapping between a hash 22 and an address 

invention, the first processing unit can compute a one-way ^ location 24. The address location 24 comprises the location 

hash of a directory of files or directories or a hash of the 0 f the document as it is electronically stored in the file 

hashes of the directories, recursively. system 28. For any particular document that is stored in the 

Still other advantages and benefits of the invention will file system 28. the hash of that document will be identical in 

become apparent to those skilled in the art upon a reading the document management index and the hash location 

and understanding of the following detailed description. 30 index. Accordingly, when the document is identifiled by the 

document attribute 14. the hash associated with that attribute 

BRIEF DESCRIPTION OF THE DRAWINGS will point to the same hash in the hash-to-location index and 

t»— „!,,„.•, f , , , . • ^ nr .„ n „ A the associated location with that hash will then point to the 

The invention may take physical rorm in certain parts and . ^, £ , *^ 

steps, and arrangements of parts and steps. The preferred rea *«* ^T ™ f^™! ? ^ t™ V*l 

embodiments of which will be described in detail in the 33 hash mus functions as a d^urr^nt key which makes the link 

specification and illustrated in the accompanying drawings between the document index and actual storage location pass 

which form a part hereof, and wherein: ° nc * eve ! of " 1 hash-to-^cauon 

™-, . . Y7 . ... * * index table 20, where the appropriate look-up is performed 

FIG. 1 is a block diagram of a document management for mc location of the document 

system formed in accordance with the present invention; ^ nQ % compriscs a flow chart for A(xing a doeauut in 

FIG. 2 is a flow chart showing the steps a user would take ^ systcm of na l At stcp 30 the user will store me 

to save a file in the document management system of FIG. document at a particular location in the file system 28. 

1< Storage is accomplished in a conventional data processing 

FIG. 3 is a flow chart illustrating the steps for retrieving manner by the central processing unit 18. The processing 

a file from the system of FIG. 1; 45 unit 18 will also include a program for computing the hash 

FIG. 4 is a flow chart for a validating process for the at step 32 from a preselected one-way hash function. The 

entries in the hatch location index of FIG. 1; hash and location are next stored at step 34 in the hash-to- 

FIG. 5 is a block diagram of an archive and backup system location index 20 to comprise the mapping from the hash 22 

implemented in accordance with the document management 10 thc Iocati011 24 for ^ particular document The docu- 

system of FIG. 1; 50 mcnI m J >nft e ffmf nT index 20 must next acquire a particular 

. . * ' . .„ _ . . _ - . , identifying attribute from the user or the document itself at 

HG. 6 is a flow chart ulusu-atmg the steps of a backup st ^ ^Ven finally at stq> 38 me appr^^ 

P™* 5 f f> r * s ^ c woriE "a* 011 ,Q accOTdaDCC Wlth mc index between me attribute and the hash 16 isTored in the 

system ot HO. S, indcx u jy^^y me attribute will comprise something as 

FIG. 7 is a flow chart illustrating the steps of an archive JS suim i e as a title, author or keyword of the original document 

process of the system of FIG. 5; and although there is no limitation on the form of the attribute. 

FIG. 8 is a flow chart illustrating the steps of retrieval Accordingly, after implementation of this process, the indi- 

from archive far the system of FIG. 5. ccs 12 and 20 sequentially link the attribute of the document 

to its hash and to a location in me file system 28 where the 

DETAILED DESCRIPTION OF THE w documcnt ^ storcd ^ ^ of storage in the respective 

INVENTION indices can be done in parallel or in reverse order from what 

Referring now to the drawings where the showings are for has been described above, 

purposes of illustrating the preferred embodiments of the FIG. 3 describes the retrieval of the document from the 

invention only, and not far purposes of limiting same. FIG. system 10. At step 40. a user will query the document 

1 shows a document management system 10 intended for 65 management index to get the particular hash for the desired 

use as an open document repository for electronically stor- document At step 42. the selected hash is identifiled to 

ing a plurality of documents. Typically, a user will access a look-up the desired location in the hash-to-location index 20 
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and at step 44. the document can then be retrieved from the longer available on a disk, and restore them from backup 

selected location. into a separate archive 70. The archive 70. need not be as big 

FIG. 4 shows the process steps for validating the hash- as documents indexed, but only those indexed but no 

to-locadon index 20 (FIG. 1). Step 48 comprises computing longer available in the file system 64. 

the hash for each file in the file system 28 and step 50 5 FIG. 6 is a flow chart of the steps of a model backup 

involves storing the hash-to-location mapping in the hash- process for a single workstation. The first step 80 comprises 

to-location index 20. Such validation can preferably occur identifying a particular file system to be backed-tra. Far each 

when making back-ups of the files in the file system. directory or file in the file system, a selected one-way hash 

In the present invention, the validation process is easy and function is used to compute an associated hash 84.' At step 

convenient to perform which is particularly important when 10 86 for c&ch computed hash, all locations of the associated 

the documents in the file system 26 are moved. For example. document arc determined, whether in the file system 64, the 

if someone were to move a document from one place to backup system 68 or the archive system 70. At step 88. if the 

another, it is easy to locate the hash from the hash -to- particular document is determined not to be in the backup 

location index. Either an explicit user action or a background system 68, then the document is copied and stored in the 

scanning process can compute the one-way hash of all the 15 backup system 68. At step 90. the hash-to-location index 60 

files of the file system, or merely a sub-set of them, and then is marked that the particular document is stored in the 

look-up the hashes in the hash-to-location table and update backup system by the mapping between the associated hash 

the associated locations in the table. Thus, in accordance an d thc backup or archive location 66. 

with the invention, moving files, restoring them from backup Step 92 is an optional step which is advantageous when 

or similar actions will not invalidate the document manage- 20 reusing or discarding backup media. Typically, backup 

ment index 12. media will be tape that can be stored for a limited period of 

Another advantage of the subject invention is the ability t * mc and that can be reused. At step 92 all entries on the 

to impose improved capability-based access for the docu- backup media being discarded or reused are erased from the 

ments in the system. One of the features of an open „ hash-to-location index 60. This last step provides the 

document management system is that it relies on the under- economy of storage space available with the subject inven- 

iying file system for security and access control. However, ^on. For example, for a file that is no longer in the file 

as noted above, the access control mechanisms of some file system 64. but is stored in the backup system 68. the hash 

systems are not particularly flexible. The subject invention to file location map 62 and the hash to backup location 66 

makes it possible to use the one-way hash of a document „ can both be erased. However, if the file is desired to be 

(either the original hash, or a variation computed by hashing stored in a separate archive system 70. the hash-to-location 

the document with some standard prefix attached) as the index can still retain a mapping between the hash and an 

capability for access to the particular document The user archive location 66. 

can declare that he is willing to give out copies of any FIG. 7 shows toe piocess for archivmg documents. At step 

document given thc capability for accessing it This capa- 35 100. a user will mark a particular file for archiving. At step 

bih'ty access control can interact with the document in the 102. the hash for the marked file is obtained from the 

process in a way to give a broad range of document access document management index, or if not stored therein, the 

control regimes with substantial advantages in document hash can be computed. At step 104. the hash-to-location 

and user authentication. index is then checked to see if the obtained hash is stored 

Other advantages which flow from the system relate to 40 mcrein « and if so. the hash is marked as wanting archive at 

time stamping the documents where, by using a one-way step 106. Step 108 involves checking the entries in the 

hash function as a document key and generating them backup system so that when reusing discarded backup 

regularly, it is easy to regularly generate secure time stamps media, the entries in that media are checked to see if a 

for the documents. Also, the ability to apply digital signa- particular file wanting archive was properly archived into 

tures to each of the documents in the file system is enhanced. 45 archive system 70. If not. then at step 110 the appropriate 

Lastly, effective document "destruction" can occur by file is read-off the backup media and copied to the archive 

destroying access to the hash. Without some means of system 70. 

acquiring the hash, the document location is forever lost The subject invention thus provides a substantial advan- 

With reference to FIG. 5. the. subject invention is Olus- kg e m backup of distributed file systems with replication. In 

trated for implementation with a backup or archive system, so me context of backing-up a distributed file system, even 

Li particular, a modifiled hash-to-location Index 60 which outside of the context of a document storage and retrieval 

not only includes the mapping to the file location 62 of the system, it is possible to not back-up most of the data. Rather, 

document in the file system 64. but also includes a hash to each system need only compute the one-way hash of the file 

backup or archive location 66 for a backup system 68 or an on it5 disk (file system), and compare the hash against an 

archive system 70. The subject invention provides a sub- 55 archive service to decide whether the data associated with 

stantial advantage in reduction in storage size or band width mc particular hash needs to be sent off-line for backup, 

of document archive space. As explained above, when it is FIG. 8 shows the advantages of retrieval from archive for 

necessary to guarantee that any document that was indexed a capability-based access system like the subject invention, 

will be available as long as the index is valid, the prior art For file retrieval from archive, at step 112 the associated 

required copying every document to a separate archive 60 hash is obtained for the document management index or 

repository in order to ensure mat it would always be avail- other source. In step 114 a check is made for the hash in the 

able. Unfortunately, the resulting repository is as large as the hash-to-location index 60. since the hash-to-location index 

original file space. The subject invention makes it possible can identify several locations in either the file system 64. the 

to use the backup system 68 of the file system as a safeguard. backup system 68 or the archive system 70 for a document 

For example, in either a background process, an explicit user 63 represented by a particular hash. At step 116. the most 

action, or an action at time of reuse of backup tapes, the accessible location for retrieving the document is identifiled 

system can notice those documents that are indexed but no and at step 118. the document is retrieved therefrom. 
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Accordingly, the advantage provided is that by using a 
one-way hash function, it is not necessary for any other 
processor access control or authorization. The retrieval 
process becomes more simple. 

The invention has been described with reference to the 
preferred embodiments. Obviously, modifications and alter* 
ations will occur to others upon the reading and understand* 
ing of this specification. It is my intention to include all such 
modifications and alterations insofar as they come within the 
scope of the appended claims or the equivalents thereof. 

Having thus described my invention. I now claim: 

1. An open document repository for electronically storing 
a plurality of documents comprising: 

a file system for electronically storing an original docu- 
ment at an address location; 

a document management index comprising a mapping 
between an attribute of the original document stored in 
the repository and a hash computed from the original 
document; and 

a hash-to-location index interposed as a key link between 
the document management index and the file system 
comprising a mapping between the hash and the 
address location of the original document in the file 
system whereby the attribute points to the hash which 
points to the location for linking the attribute to the 
location for the documents stored in the repository. 

2. The repository as defined in claim 1 wherein the hash 
comprises a computation from a low probability hash func- 
tion. 

3. The repository as defined in claim 1 wherein the hash 
comprises a computation from a cryptographically secure 
hash function. 

4. The repository as defined In claim 1 wherein the 
attribute comprises a user preselected attribute of the origi- 
nal document 

5. The repository as defined in claim 4 wherein the 
attribute comprises one from a group of title, author and 
keyword of the original document 

6. The repository as defined in claim 1 wherein the 
hash-to-location index includes means for linking the hash 40 
to a backup system, 

7. The repository as defined in claim 1 wherein the 
hash-to-location index includes means for linking the hash 
to an archive system. 

8. The repository as defined in claim 1 further including 
means for validating a particular hash associated with a 
particular location in the hash-to-location index. 

9. A document management system for electronically 
storing a document in a file system comprising: 

index means interposed between a management index and 
' the file system for sequentially linking a preselected 
attribute of the document to a hash computed from the 
document to a location in the file system where the 
document is stored. 

10. The document management system as claimed in 
claim 9 wherein the hash is computed from either a low 
probability hash function or a cryptographically secure hash 
function. 

11. The document management system as claimed in 
claim 9 wherein the attribute comprises a user selected 
feature of the document suggestive of an identity of the 
document 

11 In an open document repository containing a plurality 
of documents, a method of indexing each of the plurality for 
archival document service comprising steps of: 

identifying an attribute of one of the plurality of docu- 
ments to be electronically stored in the repository; 



computing a hash of the one document; 
mapping the hash to the attribute in a document manage- 
ment index; 

identifying an address location of the one document in a 

file system of the repository; and. 
mapping the address location to the hash in a hash-to* 
location index wherein the hash-to-localion is disposed 
as a security key link between the document manage- 
ment index and the file system and whereby the 
attribute is used to point to the location through the 
indexes. 

13. A document management method for a document 
repository containing a plurality of documents in a file 
system of the repository comprising steps of: 

storing an original document in the file system at a 
location; 

computing a hash of the document using a one-way hash 
function; 

storing the hash and the location as an associated mapping 
in a hash to location index disposed as a security link 
for an entrance to the file system; 
acquiring an attribute of the document; and, 
storing the attribute and the hash in a document manage- 
ment index whereby the attribute points to the hash and 
the hash points to the location for selective access of the 
document from the file system. 

14. The document management method as claimed in 
claim 13 further comprising steps for retrieval of the docu- 

» mentof: 

querying the document management index to identify the 

hash for the document; 
using the hash to identify the location from the hash to 

location index; and. 
retrieving the document from the location. 

15. The document management method as claimed in 
claim 13. wherein a back up system is also provided and 
further comprising steps for backup of the repository of: 

identifying a particular file system of the repository; 
for each directory or file in the particular file system 
computing a representative hash from the one way hash 
function; 

for each computed representative hash, determining all 
locations for said each directory or file in the back up 
and file systems from the hash to location index; 
copying said each directory or file to the back up system 

determined not to be at a back up location; and. 
marking the hash to location index to indicate that the 
copied directory or files are in the back up system. 

16. The document management method as claimed in 
claim IS further comprising, when reusing or discarding 
backup media, identifying and erasing all entries for the 
media from the hash to location index. 

17. The document management method as claimed in 
claim 15. wherein an archive system is also provided and 
further comprising steps for archiving of the repository of: 

marking one file for archive; 
obtaining a hash for the one file; 
detennining if the hash is in the hash to location index and 

if the hash is marked as archived or backed up; 
marking the hash as wanting archive in the hash to 
location index. 

18. The document management method as claimed in 
claim 17 further comprising steps of retrieval of the original 
document of: 
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obtaining the hash; 

identifying a most accessible location of the document 
from the hash to location index; and. 

retrieving the document from said most accessible loca- 
tion. 

19. The document management method as claimed in 
claim 17 further comprising when reusing backup media, 
checking if any entry on the backup media that is marked as 
wanting backup in the hash to location index is actually 
archived, and if not copying said entry to archive. 

20. An indexing system using a one-way hash for backup/ 
archive document service in a distributed file system includ- 
ing a plurality of processing units each having a plurality of 
files for selective backup or archive, wherein the plurality of 
files are intended for coinmunicau'on to and storage in the 
backup/archive document service, comprising: 

means for computing the one-way hash for a one of the 
files in a one of the processing units; and. 

means for determining if said one-way hash is already 
included in the backup/archive document service and if 
not. storing the one file and recording said one-way 
hash in the backup/archive document service, and if 
already included, computing another one-way hash for 
another of the files. 

21. The indexing system as defined in claim 20 including 
a directory of one-way hashes respectively associated with 
a plurality of files wherein said one processing unit includes 
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means for computing a hash-of-hashes of said directory 
recursively, and means for storing said hash-of-hashes in the 
backup/archive document service. 

22. A method for synchronizing files between first and 
5 second processing units and for replicating files 

therebetween, comprising: 
designating the first processing unit as a file repository 
having a plurality of files; 
10 designating the second processing unit as a backup ser- 
vice to the file repository; 
computing by the second processing unit of a one-way 
hash of a one of the files in the first repository; and. 
determining if said one-way hash is already stored in the 
15 second processing unit, and if not. storing the one file 
and recording the one-way hash in the second process- 
ing unit, and if already included, calling another of the 
files of the first processing unit for said computing and 
^ deternuning. 

23. The method as defined in claim 22 wherein said first 
processing unit includes a directory of the files stored therein 
and said second processing unit further selectively computes 
an other one-way hash of said directory or a directory of files 

u and directories, or a hash-of-hashes of said directory, recur- 
sively. 

* * • * * 
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